By:
1. BING HONGJIAN - A0172543H
2. BRYON LIEW KAI RONG - A0168236Y
3. LIM WEE SHENG GAVIN - A0166830A
4. ZHOU KAI JING - A0171366A

1.1 Introduction to Bike Sharing Services in Singapore

image

image

Bike-sharing is a service where bicycles are made available for shared use to individuals on a short-term basis. Generally, these businesses follow three distinct business models: Docking stations, geo-fencing and free-float models.

Firstly, the docking station models only allows bicycles to be borrowed and returned from a designated docking stations. Secondly, the geo-fencing model allows bike hires to be ended within or outside of a designed virtual fence. And finally, free-floated bikes allows the users to drop off their bikes at any location within a city’s boundaries.

In Singapore, the bike sharing industry is dominated by free float bikes that are being offered by several private companies. The industry began with the entrance of a local bike sharing company, Obike, in early 2017. The entrance of Obike was the followed by Ofo and Mobike, who were both established companies in the bike-sharing market, with a huge presence in China. However, in recent months it seems that some companies have emerged as losers in the fierce competition, facing losses due to illegal parking and damage to bikes. In early 2019, Mobike had announced its plans to exit the market and as such, none of these original companies remain in Singapore.

After surveying the different bicycle parking stations located in Singapore, our team has also come to realise a separate problem: The uneven spread of shared bicycles across different the stations in Singapore. What this means is that at some stations, we observe a surplus of bicycles to docks and there are more parked bicycles than the number of docks allocated to that particular station. On the other hand, some of the stations are facing a shortage of bicycles and the docks are totally barren with no bicycles parked at all.

This report will describe our team’s project in creating an application, and display its functions in fixing the problems these bike sharing companies are facing. We will accomplish this by analysing data of bike-sharing services across different stations in San Francisco’s Bay Area in 2014. From our analysis, a solution will be provided to bike sharing companies to help them distribute the number of bikes across different stations and resolve the problem of shortages and surplus of bicycles in their stations.

1.2 The existing technology and market practices in Bike Sharing Services

Currently, shared bicycles work based on a QR code system. The QR codes are a sort of “key” that allows to bicycles to locked, unlocked and parked.

The codes must first be scanned through the company’s application, after which the bike can will be unlocked and can be used. When returning their bicycles, customers must first bring their bikes to a designated parking station and scan a QR code located at the parking zone to end their trip, park and lock their bikes.

These parking zones contain a geo-fencing system which only enables bikers to park and lock the bikes when the bikes are within the virtual perimeter (size of the parking zones) assigned by the biking companies.

All stations follow a similar design with 3 separate sizes: Small, Medium and Large. The different sized stations have different numbers of docks to accomodate parked bicycles, 5 at small stations, 8 at medium stations and 10 at large stations. These stations are located according to the expected traffic in the area, with large stations often located close to MRT stations and bus interchanges.

This current practice of locating parking stations according to the expected traffic in the area is certainly logical as areas with a high flow of possible customers will require more bikes. However, when examining the data we still observe small and medium stations with a shortage of bikes and large stations with a surplus. Our projects aims to reduce this problem.

1.3 The pain points of current market practices to consumers

Unlike other cities like London and Taipei, Singapore bike sharing businesses follow a free-floating system where bike users can park the bikes anywhere without penalties. This practice has resulted in the issue of indiscriminate parking by bike users. To tackle this problem, the Land Transport Authority (LTA) has announced new regulations for dockless shared bike operators as well as the users. As reported by the Straits Times in September 2018, LTA will be installing Quick Response (QR) codes at public bicycle parking places to ensure that bikes are parked correctly. Furthermore, errant users will be charged $5 by licensed operators each time they park these bicycles indiscriminately.

However, given the limited space for bicycles in parking stations, other problems might arise as well. We foresee that this new law may cause problems of bicycle overcrowding at destinations that are popular end points, but less popular start points. Whereas areas that are popular start points, but less popular end points might faces issues with bicycle shortages throughout the day.

Here is an image of some cases of overcrowding at bicycle stations: image

Furthermore, timing is a factor that creates bike shortage or excess. For example, at places like the CBD area and during peak hours, the demand for renting a bike is likely to be high. This causes bike shortages at stations in these area and customers may not be able to rent a bike at the station.

Under the new regulation, bike users must park their bikes at the designated QR code areas if they do not want to be fined. However, given the limited space at each station, there may be overcrowding problem as bike users squeeze and dump their bikes anywhere within the virtual fence. This will cause more incidents of indiscriminate parking.

Therefore, allowing customers to know the number of bikes available at the stations and to visualise the demand of bikes at certain stations under certain time intervals are essential in helping to deal with such problems

1.4 How will we tackle the problem? What data is used?

Initially, our group intended to use data that from local bike sharing companies that operate in Singapore such as OFO, Mobike and SGBike. They run their businesses locally and are more likely to be relevant for the purposes of this project. However, in the course of the project, taking into consideration the availability of these datasets and that most of the bicycle sharing companies such as OFO, Mobike and oBike have either exited or are planning to exit the local market, we have decided to use an alternative source of bicycle sharing data from San Francisco Bay Area.

The final dataset we have chosen to use can be obtained from this link: Data Source

2. Data Analysis

2.1 Importing packages into library

library(chron)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(ggplot2)
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(ggthemes)
library(leaflet)
library(readr)
library(reshape2)
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths

2.2 Importing data into Rstudio

setwd(getwd())

StationData <- read_csv("station.csv")
## Parsed with column specification:
## cols(
##   id = col_double(),
##   name = col_character(),
##   lat = col_double(),
##   long = col_double(),
##   dock_count = col_double(),
##   city = col_character(),
##   installation_date = col_character()
## )
TripData <- read_csv("trip.csv")
## Parsed with column specification:
## cols(
##   id = col_double(),
##   duration = col_double(),
##   start_date = col_character(),
##   start_station_name = col_character(),
##   start_station_id = col_double(),
##   end_date = col_character(),
##   end_station_name = col_character(),
##   end_station_id = col_double(),
##   bike_id = col_double(),
##   subscription_type = col_character(),
##   zip_code = col_double()
## )
## Warning: 11058 parsing failures.
##   row      col               expected actual       file
## 11506 zip_code no trailing characters  -2585 'trip.csv'
## 15287 zip_code no trailing characters  -2585 'trip.csv'
## 15409 zip_code no trailing characters  -2585 'trip.csv'
## 21731 zip_code no trailing characters  -2585 'trip.csv'
## 32654 zip_code no trailing characters  -2585 'trip.csv'
## ..... ........ ...................... ...... ..........
## See problems(...) for more details.
WeatherData <- read_csv("weather.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   date = col_character(),
##   precipitation_inches = col_character(),
##   events = col_character()
## )
## See spec(...) for full column specifications.

The dataset consists of 4 different csv files: station, status, weather and trips data.

In order to facilitate the process of data analysis, data cleaning was performed after the acquisition of data by merging Stations, Status, and Weather into a single dataframe. From the combined dataframe, data was aggregated on an hourly average basis for us to understand the hourly trends on the availability of the bicycles and docks data for each station. In addition, we can also draw insights on the usage rate from these two metrics.

2.3 Data cleaning

if (!("Status9" %in% ls())) {
  if (file.exists("Data_cleaned.csv")) {
    Status9 <- read_csv("Data_cleaned.csv")
  } else {
    StatusData <- read_csv("status.csv")#import bike station status data
    
    StatusData$Year <- as.numeric(substr(StatusData$time,1,4)) #extract year and  form a new column
    StatusData <- StatusData[StatusData$Year==2014,] #only analysing in year 2014
    StatusData$Date <- paste0(substr(StatusData$time,9,10),"/",substr(StatusData$time,6,7),"/",substr(StatusData$time,1,4)) #add date
    StatusData$Hour <- as.numeric(substr(StatusData$time,12,13)) #add hour column
    StatusData$Min <- as.numeric(substr(StatusData$time,15,16)) #add minute column
    
    StatusData <- StatusData[,-4] #remove original time column
    StatusData$Date <- as.Date(StatusData$Date, "%d/%m/%Y")
    Status2 <- full_join(StatusData,StationData[,c(-7)],by=c("station_id"="id"))
    
    Status3 <- Status2[,-2]
    Status4 <- Status2[,-3]
    
    Status5 <- Status3 %>%
      group_by(station_id, Date, Year, Hour, name, lat, long, dock_count, city) %>%
      summarize(Average_Docks = mean(docks_available)) 
    
    Status6 <- Status4 %>%
      group_by(station_id, Date, Year, Hour, name, lat, long, dock_count, city) %>%
      summarize(Average_Bikes = mean(bikes_available))
    
    Status7 <- full_join(Status6, Status5)
    
    Status7$Usage_Rate <- Status7$Average_Docks/(Status7$Average_Bikes + Status7$Average_Docks)
    
    Status7$Date <- as.Date(Status7$Date, "%d/%m/%Y")
    
    WeatherData$date <- as.Date(WeatherData$date, "%m/%d/%Y")
    WeatherData$Year <- as.numeric(strftime(WeatherData$date, "%Y"))
    
    WeatherData <- WeatherData[WeatherData$Year==2014,]
    Status8 <- full_join(Status7,WeatherData,by=c("Date"="date"))
    
    na_count <- sapply(Status8, function(y) sum(length(which(is.na(y)))))
    na_count <- data.frame(na_count)
    
    Status9 <- Status8[complete.cases(Status8[,c("mean_temperature_f","mean_humidity","cloud_cover")]),]
    
    names(Status9)[names(Status9) == 'Year.x'] <- 'Year'
    Status9 <- Status9[,-ncol(Status9)]
    
    write.csv(Status9, file = "Data_cleaned.csv", row.names = FALSE)
  } 
}
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Date = col_date(format = ""),
##   name = col_character(),
##   city = col_character(),
##   precipitation_inches = col_character(),
##   events = col_character()
## )
## See spec(...) for full column specifications.

Before further analysis on weather effects on bicycle usage, a correlation matrix was generated for the three selected variables of analysis: Mean Temperature, Mean Humidity, and Cloud Cover.

2.4 Correlation Matrix

cor(Status9[,c("mean_temperature_f","mean_humidity","cloud_cover")], use="complete")
##                    mean_temperature_f mean_humidity cloud_cover
## mean_temperature_f          1.0000000    -0.2254128  -0.1827273
## mean_humidity              -0.2254128     1.0000000   0.5756338
## cloud_cover                -0.1827273     0.5756338   1.0000000

From the matrix, we can assume that there is no interaction between all the independent variables since their correlation coefficients do not exceed the threshold of 0.8, hence indicating no multicollinearity.

2.5 ANOVA and multiple linear regression analysis

anova.twoway <- aov(Usage_Rate ~ mean_temperature_f + mean_humidity + cloud_cover, data = Status9)
summary(anova.twoway)
##                         Df Sum Sq Mean Sq F value   Pr(>F)    
## mean_temperature_f       1     63   63.39 1872.88  < 2e-16 ***
## mean_humidity            1      1    1.25   36.94 1.22e-09 ***
## cloud_cover              1      8    7.91  233.70  < 2e-16 ***
## Residuals          2945130  99687    0.03                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Multiple Linear Regression
MLRModel <- lm(Usage_Rate ~ mean_temperature_f + mean_humidity + cloud_cover, data=Status9)
summary(MLRModel)
## 
## Call:
## lm(formula = Usage_Rate ~ mean_temperature_f + mean_humidity + 
##     cloud_cover, data = Status9)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53555 -0.11746  0.00732  0.11838  0.49396 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.690e-01  1.335e-03  351.34   <2e-16 ***
## mean_temperature_f  6.937e-04  1.635e-05   42.42   <2e-16 ***
## mean_humidity       1.549e-04  1.141e-05   13.57   <2e-16 ***
## cloud_cover        -8.823e-04  5.772e-05  -15.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.184 on 2945130 degrees of freedom
## Multiple R-squared:  0.0007273,  Adjusted R-squared:  0.0007263 
## F-statistic: 714.5 on 3 and 2945130 DF,  p-value: < 2.2e-16

Subsequently, further ANOVA and multiple linear regression analysis were then performed on the cleaned data and we found out that there was a correlation between the Usage Rate and all the independent variables of mean temperature, mean humidity and cloud cover. Since the p-value of ANOVA analysis falls below the statistical significance level of 0.05, we can conclude that there all 3 weather variables of analysis are good indicators on usage of bicycles. Using this understanding, we can help consumers predict with greater confidence the availability of bicycles and docks at each location for their bicycle sharing service.

3. Geo-spatial Analysis

Our team has also included multiple geo-spatial mapping functions in the application that allows us to see many trends in the spatial distribution of Bicycles in San Francisco’s Bay area.

3.1 Registering API key and getting map of San Francisco

register_google(key='AIzaSyBKNZLfGl7FCDNmIhZmibexeWPBKl5XWXI')
bay_area_map<-get_map(location = c(mean(StationData$long),mean(StationData$lat)),source='google',zoom=10)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=37.590243,-122.218416&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx

3.2 Generating map

ggmap(bay_area_map)+
  geom_point(data=StationData,aes(x=long,y=lat,color=city,size=dock_count),alpha=0.5)+
  ggtitle('Bike Stations in Bay Area')+
  theme_map()

leaflet()%>%addTiles() %>%
  addMarkers(data =StationData, lng = ~long, lat = ~lat,popup = ~name,
             clusterOptions = markerClusterOptions())

From the map, we observe that the stations are mainly located at 5 different cities in the Bay Area. These cities are Mountain View, Palo Alto, Redwood City, San Francisco and San Jose.

3.3 Number of stations and docks in each city

#number of stations by city
station_counts <- count(StationData, vars = city)
colnames(station_counts) <- c("City", "Number_of_Stations")
station_counts
## # A tibble: 5 x 2
##   City          Number_of_Stations
##   <chr>                      <int>
## 1 Mountain View                  7
## 2 Palo Alto                      5
## 3 Redwood City                   7
## 4 San Francisco                 35
## 5 San Jose                      16
#number of docks by city
dock_counts <- aggregate(StationData$dock_count, by=list(city=StationData$city), FUN=sum)
colnames(dock_counts) <- c("City", "Number_of_Docks")
dock_counts
##            City Number_of_Docks
## 1 Mountain View             117
## 2     Palo Alto              75
## 3  Redwood City             115
## 4 San Francisco             665
## 5      San Jose             264
#Plotting number of docks and stations by city
dock_station_counts <- left_join(station_counts,dock_counts, by = "City")
dock_station_counts <- dock_station_counts %>% gather(key, value, -City)
ggplot(dock_station_counts, aes(x = City, y = value)) + 
  geom_bar(aes(fill = key), stat="identity", position = "dodge") + 
  ggtitle("Plot of number of Docks and Stations in a City") +
  ylab("Number of Stations")

From the bar graph above, it can be seen that San Francisco and San Jose have the most number of stations and bicycle docks in the Bay Area.

Before going fully into geo-spatial analysis of the data, we first clean the data.

3.4 Data cleaning for geo-spatial analysis

#firstly, filter trips with arrive/departure stations that are not in the 'station data set'
stations<-union(unique(TripData$end_station_name),unique(TripData$start_station_name))

setdiff(unique(TripData$start_station_name),StationData$name)
## [1] "Broadway at Main"           "San Jose Government Center"
## [3] "Post at Kearny"             "Washington at Kearny"
sta<-setdiff(unique(TripData$end_station_name),StationData$name) #identified 4 stations, remove them

tripdata<-TripData[(!TripData$start_station_name %in% sta & !TripData$end_station_name%in% sta),]

#clean the trip data, such as convert the duration from seconds to minutes 
tripdata$duration<-tripdata$duration/60
#check the pattern
ggplot(data=tripdata)+
  geom_boxplot(aes(y=duration))

#remove the outliers, top 0.1%
quantile(tripdata$duration,0.999)
##    99.9% 
## 1200.427
tripdata<-tripdata[tripdata$duration<quantile(tripdata$duration,0.999),]

#sort the stations based on their latitude so that the stations are more close to each other in geography
stationdata<-StationData[order(StationData$lat,decreasing = T),]

With the data cleaned, we now conduct a network analysis in order to view the number of trips between each each station.

trip_count<-matrix(0,ncol=length(unique(tripdata$start_station_name)),nrow=length(unique(tripdata$end_station_name)))
rownames(trip_count)<-stationdata$name
colnames(trip_count)<-stationdata$name
for(i in 1:length(tripdata$id)){
  sta=tripdata[i,'start_station_name'][[1]]
  end=tripdata[i,'end_station_name'][[1]]
  trip_count[sta,end]=trip_count[sta,end]+1
}

We then plotted a heat map to describe the relationship between trips between stations.

b<-log(trip_count+1) #use log value to plot heatmap to view the relationship between stations
heatmap.2(b, Rowv=FALSE, symm=TRUE, margin=c(6, 6), trace="none")
## Warning in heatmap.2(b, Rowv = FALSE, symm = TRUE, margin = c(6, 6), trace
## = "none"): Discrepancy: Rowv is FALSE, while dendrogram is `both'. Omitting
## row dendogram.
## Warning in heatmap.2(b, Rowv = FALSE, symm = TRUE, margin = c(6, 6),
## trace = "none"): Discrepancy: Colv is FALSE, while dendrogram is `column'.
## Omitting column dendogram.

From the heatmap, we can observe a trip pattern where trips are commonly seen to be intra-city instead of inter-city. This indicates to us that it is more useful to look at the data city by city instead of as a whole when we study the bike problem.

Next, our team calculated the number of trips starting and ending from each station. This allows us to calculate the in-degree and out-degree centrality of each node (station) in this network graph.

in_centrality<-as.data.frame(table(tripdata$start_station_name))
colnames(in_centrality)<-c('name','In_Degree_Centrality')
out_centrality<-as.data.frame(table(tripdata$end_station_name))
colnames(out_centrality)<-c('name','Out_Degree_Centrality')

After that, the difference of [out degree] - [in degree] is calculated. By dividing the average of the two degree, we could get an index the severtiy of bike excess (diff<0) or bike shortage (diff>0).

station_info<-full_join(in_centrality,out_centrality)
## Joining, by = "name"
station_info<-full_join(stationdata,station_info)
## Joining, by = "name"
## Warning: Column `name` joining character vector and factor, coercing into
## character vector

Finally, wecalculate the different between bikes in and out from each station and label the areas as facing a shortage or excess respectively.

station_info$Diff<-station_info$In_Degree_Centrality - station_info$Out_Degree_Centrality
station_info$Avg<-(station_info$In_Degree_Centrality + station_info$Out_Degree_Centrality) /2
station_info$index<- station_info$Diff / station_info$Avg
station_info$State<-ifelse(station_info$index>0,'Excess','Shortage')

3.5 Map of Excess and Shortage

if (!(file.exists("Data_station_info.csv"))) {
  write.csv(station_info, file = "Data_station_info.csv", row.names = FALSE)
}

ggmap(bay_area_map)+
  geom_point(data=station_info,aes(x=long,y=lat,color=State,size=abs(index)),alpha=0.5)+
  scale_size_continuous(guide = FALSE)+
  ggtitle('Bike Stations in Bay Area ')+
  theme_map()

pal <- colorFactor(c("blue", "red"), domain = c("Shortage", "Excess"),levels=c('Shortage','Excess'))
leaflet(station_info) %>% addTiles() %>%
  addCircleMarkers(lng=~long,lat=~lat,
                   radius = ~abs(index)*20,
                   color = ~pal(State), popup = ~name,
                   stroke = FALSE, fillOpacity = 0.5
  )
if (!(file.exists("Data_cleaned_trip.csv"))) {
  write.csv(tripdata, file = "Data_cleaned_trip.csv", row.names = FALSE)
}

Plotting these areas of excess and shortage on the map of San Francisco, We can then see that stations in SF are mostly facing shortage issue as average bikes returning is less than bikes needed.

Next, we will work with the cleaned “Status9” data to view bike and dock availability across stations. To do this, we first began with San Jose Caltrain Station.

test_set<-Status9[1:5,]
leaflet() %>%  addTiles() %>%
  addMarkers(data = test_set, lng = ~long, lat = ~lat,
             popup = ~name)
PlotData<-Status9[(Status9$Date=='2014-05-02' & Status9$Hour=='12'& Status9$name=='San Jose Diridon Caltrain Station'),c(5,6,7,8,10,11) ]

In addition to the map locations, additional information such as the number of bikes an docks available at the station can be found. This information will be valuable for customers as it allows them to go to stations where there is a supply of bicycles for use. This also allows them to choose the specific station and time they want information on.

#check bike availability
popup<-paste(sep='<br/>',
             PlotData$name, paste(sep=' ','Bike available:',PlotData$Average_Bikes),
             paste(sep=' ','Docks available:',PlotData$Average_Docks),
             paste(sep=' ','Total Docks here:',PlotData$dock_count)
)

leaflet() %>%  addTiles() %>%
  addMarkers(data=PlotData,lng = ~long, lat = ~lat,popup=popup
             ) #customer can view bike and dock availability from the map
#
San_Jose_Data<-Status9[(Status9$Date=='2014-05-02' & Status9$Hour=='12'& Status9$city=='San Jose'),c(5,6,7,8,10,11) ]
popup1<-paste(sep='<br/>',
              San_Jose_Data$name, paste(sep=' ','Bike available:',San_Jose_Data$Average_Bikes),
             paste(sep=' ','Docks available:',San_Jose_Data$Average_Docks),
             paste(sep=' ','Total Docks here:',San_Jose_Data$dock_count)
)
leaflet() %>%  addTiles() %>%
  addMarkers(data=San_Jose_Data,lng = ~long, lat = ~lat,popup=~popup1
  ) #plot all stations status at San Jose on 2014-05-02, and at 12 pm. 

4. Graphical Analysis

In additional to geo-spatial analysis, our app will include various graphs that allow customers and the company to understand key distributions within the data.

4.1 Distribution of trip duration

The first graphical analysis we conducted was to understand the distribution of trip durations across customers.

#Density plot of trip duration,for duration wintin 60 minutes 
ggplot(tripdata, aes(x = duration)) + 
  geom_density() + 
  xlim(0,60) +
  ggtitle("Density plot of trip duration")
## Warning: Removed 19808 rows containing non-finite values (stat_density).

#Density plots of trip duration by subscription type
subscription_trip <- tripdata[,c(2,10)]
ggplot(subscription_trip, aes(x = duration, fill = subscription_type)) + 
  geom_density(alpha=0.5) + 
  xlim(0,60) +
  ggtitle("Density plots of trip duration by subscription type")
## Warning: Removed 19808 rows containing non-finite values (stat_density).

We observed that a large proportion of trips last within 3 to 15minutes. We then distinguished the duration between the subscription type of customers and found out that guest customers have a much wider spread of trip durations, with more opting for longer rides as compared to the shorter and more consistent rides of subscribers.

Our team concluded that subscribers of bike sharing services probably use these services to get to-and-from a destination and thus carry out the same trip multiple times. Guest customers however, may have different purposes of renting the bikes such and may use the bikes for extended periods of time for sightseeing or exercise.

Next, our team conducted graphical analysis by filtering the data according to months, days and hours. To do this, we first created new columns to represent the data, month, hour and minute into trips data. This data is taken from the starting time of the trip

4.2 Adding date, month, hour, an minute into trips data according (start time)

date.time<-t(as.data.frame(strsplit(tripdata$start_date,' ')))
row.names(date.time) = NULL

tripdata$date<-as.Date(date.time[,1],"%m/%d/%Y")
tripdata$time<-date.time[,2]

tripdata$Month<-format(tripdata$date,'%m')
tripdata$Month<-as.numeric(tripdata$Month)
mymonths <- c("January","February","March",
              "April","May","June",
              "July","August","September",
              "October","November","December")
tripdata$Month <- mymonths[tripdata$Month]
tripdata$Month = factor(tripdata$Month, levels = month.name)

tripdata$Day <- weekdays(tripdata$date)
tripdata$Day<-factor(tripdata$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))

tripdata$hour<-ifelse(nchar(tripdata$time)==5,substr(tripdata$time,1,2),substr(tripdata$time,1,1))
tripdata$hour<-as.numeric(tripdata$hour)

4.3 Distribution of trips by month

month_counts <- count(tripdata, vars = Month)
colnames(month_counts) <- c("Month","Number_of_Rides")
month_counts
## # A tibble: 12 x 2
##    Month     Number_of_Rides
##    <fct>               <int>
##  1 January             51036
##  2 February            44292
##  3 March               53779
##  4 April               55051
##  5 May                 55416
##  6 June                59164
##  7 July                60782
##  8 August              62422
##  9 September           55447
## 10 October             61799
## 11 November            48545
## 12 December            38644
ggplot(month_counts, aes(x=Month,y=Number_of_Rides)) + 
  geom_bar(stat="identity", fill="red") + ggtitle("Plot of number of rides by Month") + 
  ylab("Number of Rides")

From this graph, we observed that demand for bikes will generally increase from March to October, except a fall in September. November, December, January and February are also identified to have months with less customers.

4.4 Plotting variation in number of rides by month across the different cities

#plotting by month and start city
StationData$count<-rownames(StationData)
start.city<-c()
for(i in 1:length(tripdata$id)){
  k<-which(tripdata$start_station_name[i] == unique(StationData$name))
  start.city[i]<-StationData$city[StationData$count==k]
  
}
tripdata$start.city<-start.city

end.city<-c()
for(i in 1:length(tripdata$id)){
  k<-which(tripdata$end_station_name[i] == unique(StationData$name))
  end.city[i]<-StationData$city[StationData$count==k]
  
}
tripdata$end.city<-end.city

tmp <- tapply(tripdata$id, INDEX = list(tripdata$Month, tripdata$start.city), length)
tmp <- as.data.frame(tmp) #dataframe coercion
tmp$Month = rownames(tmp) #Creating year column as the rownames
tmp
##           Mountain View Palo Alto Redwood City San Francisco San Jose
## January            1255       485          220         46252     2824
## February           1231       339          203         40010     2509
## March              1611       499          279         48390     3000
## April              1633       501          292         49583     3042
## May                1729       583          275         49325     3504
## June               1944       652          324         52498     3746
## July               2103       820          396         53874     3589
## August             2012       713          352         55757     3588
## September          1246       609          288         49811     3493
## October            1362       662          286         55760     3729
## November           1095       505          258         43932     2755
## December            862       315          106         35414     1947
##               Month
## January     January
## February   February
## March         March
## April         April
## May             May
## June           June
## July           July
## August       August
## September September
## October     October
## November   November
## December   December
stats <- melt(tmp, id.vars = "Month", variable.name="series") 
stats$Month<-factor(stats$Month, levels = month.name)

ggplot(stats, aes(x=Month,y=value)) + 
  geom_bar(stat="identity", aes(fill = series), position="fill") + 
  ylab("Proportion of City") + 
  scale_fill_manual("legend", values = c("Mountain View" = "black", "Palo Alto" = "brown", "Redwood City" = "red", "San Francisco"= "orange", "San Jose" = "yellow"))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

When the data is split across the different cities, we find that th percentage of bikes from each city is relatively constant. Although the number of trips changes across cities each month, the proportion of bikers are roughly the same from each city.

4.5 Distribution of trips by day

#Counting trips by day
day_counts <- count(tripdata, vars = Day)
colnames(day_counts) <- c("Day","Number_of_Rides")
day_counts$Day<-factor(day_counts$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))

#Plotting number of riders by day
ggplot(day_counts, aes(x=Day,y=Number_of_Rides)) + geom_bar(stat="identity", fill="blue") + ggtitle("Plot of number of rides by Day") + ylab("Number of Rides")

By analysing the number of trips by day, we observe that overall there are less rides on weekends (Saturday & Sunday).This could indicate that a majority of bike-share users use these bicycles to commute to work or school on weekdays.

4.6 Plotting variation in number of rides by day across the different cities

tmp2 <- tapply(tripdata$id, INDEX = list(tripdata$Day, tripdata$start.city), length)
tmp2 <- as.data.frame(tmp2) #dataframe coercion
tmp2$Day = rownames(tmp2) #Creating year column as the rownames
tmp2
##           Mountain View Palo Alto Redwood City San Francisco San Jose
## Monday             3453      1006          544        100965     6120
## Tuesday            3507      1028          566        106588     6452
## Wednesday          3311       991          633        104739     6494
## Thursday           3211       936          593        103642     6665
## Friday             2686       987          538         95051     6140
## Saturday            863       822          221         37905     3054
## Sunday             1052       913          184         31716     2801
##                 Day
## Monday       Monday
## Tuesday     Tuesday
## Wednesday Wednesday
## Thursday   Thursday
## Friday       Friday
## Saturday   Saturday
## Sunday       Sunday
stats2 <- melt(tmp2, id.vars = "Day", variable.name="series") 
stats2$Day<-factor(stats2$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))

ggplot(stats2, aes(x=Day,y=value)) + 
  geom_bar(stat="identity", aes(fill = series), position="fill") + 
  ylab("Proportion of City") + 
  scale_fill_manual("legend", values = c("Mountain View" = "black", "Palo Alto" = "brown", "Redwood City" = "red", "San Francisco"= "orange", "San Jose" = "yellow"))

When we plot to observe variations in proportion of ride across cities, we see that San Francisco experiences a decrease in bike trips on weekends, contributing largely to the decrease in riders on weekends. On the other hand, Palo Alto and San Jose experience higher proportion bike rides on weekends.

4.7 Distribution of trips by hour

hour_counts <- count(tripdata, vars = hour)
colnames(hour_counts) <- c("Hour","Number_of_Rides")

hour_counts$Hour<-factor(hour_counts$Hour,levels = seq(0,24))
#Plotting number of riders by day
ggplot(hour_counts, aes(x=Hour,y=Number_of_Rides)) + geom_bar(stat="identity", fill="green") + ggtitle("Plot of number of rides by Day") + ylab("Number of Rides")

By analysing the number of trips by hour, we observe that the largest concentration of rides occur in the morning from 7-9am and in the evening from 4-7pm. This reinforces our team’s observation that commuters use these bikes to get to-and-from work.

4.8 Plotting variation in number of rides by hour across the different cities

tmp3 <- tapply(tripdata$id, INDEX = list(tripdata$hour, tripdata$start.city), length)
tmp3 <- as.data.frame(tmp3) #dataframe coercion
tmp3$Hour = rownames(tmp3) #Creating year column as the rownames
tmp3$Hour<-as.factor(tmp3$Hour)
tmp3
##    Mountain View Palo Alto Redwood City San Francisco San Jose Hour
## 0             19        23           12          1729      287    0
## 1             43        31            6           812      222    1
## 2              8        12            6           554       55    2
## 3              4        11           NA           291       15    3
## 4              3         2            2           974       25    4
## 5            109         5            5          2962      303    5
## 6            428        61           17         12301     1088    6
## 7           1489       200          364         37531     3653    7
## 8           2187       617          401         77101     3151    8
## 9           2549       636          299         55109     2003    9
## 10           733       392          117         26236     1407   10
## 11           513       357          117         25192     1687   11
## 12           525       403          220         29486     2209   12
## 13           452       358          158         27332     1944   13
## 14           415       410          102         23315     1484   14
## 15           601       398          175         29006     1601   15
## 16          1730       537          307         50849     3660   16
## 17          2375       733          339         72106     4721   17
## 18          1961       630          272         49803     3205   18
## 19          1003       422          161         24700     1848   19
## 20           391       195           80         14157     1072   20
## 21           295       133           45          9472      844   21
## 22           169        85           47          6046      739   22
## 23            81        32           27          3542      503   23
stats3 <- melt(tmp3, id.vars = "Hour", variable.name="series") 
stats3$Hour<-factor(stats3$Hour,levels = seq(0,24))
#plotting variation in number of rides by day across the different cities
ggplot(stats3, aes(x=Hour,y=value)) + geom_bar(stat="identity", aes(fill = series), position="fill") + ylab("Proportion of City") + scale_fill_manual("legend", values = c("Mountain View" = "black", "Palo Alto" = "brown", "Redwood City" = "red", "San Francisco"= "orange", "San Jose" = "yellow"))
## Warning: Removed 1 rows containing missing values (position_stack).

When we plot to observe variations in proportion of ride across cities, we see that the proportion of number of trips remains even across all five cities. Although there are some variations in trips from 2200hrs to 0500hrs, the number of total trips at these timings are very low thus there will be larger variations in proportion at these timings.

5. Shiny App

Our team aims to integrate the functions of statistical, geo-spatial and graphical analysis into a single application on bike-sharing services. This app will provide an interactive interface for our users to view the different data visualizations available and also how the different parameters (Date, Time, City, Station) affect the visualization.

5.1 Strategies employed to improved bike-sharing services

To solve the various problems the bike-sharing industry is facing, our team has come up with two suggestions for improvement that these companies could adopt: Dynamic Pricing and Bicycle Redistribution.

5.1.1 Dynamic Pricing

The first strategy to implement is dynamic pricing. According to this payment policy, the price of renting a shared bike will change throughout the day. This price will be dependent on the demand (Rate of bikes getting rented) and the availability of bikes at both the starting station and ending station.

For example, getting a bike from a high demand, low availability station will cost users more as compared to getting one from a low demand, high availability one. Similarly, returning the bike to a high demand, low availability station will cost users less as compared to returning one to a low demand, high availability one.

By adopting this pricing strategy, users will be encouraged to pick up their bikes from stations with greater number of bikes and return them at stations with lesser number of bikes. This will help to resolve the issue of an uneven distribution of bikes and also the issue of shortage of bikes at selected stations.

Here is an example of dynamic pricing, taken from Uber: image

5.1.2 Redistribution of supply of bikes

The second strategy targets the problem of excess and shortages of bicycles in stations directly.

As seen from the ggmaps from our geo-spatial analysis, some areas observe an excess in the number of bikes available and likewise, other areas observe a shortage in the number of bikes available. One easy way to resolve the uneven distribution of bikes is to redistribute the supply of bikes according to the data.

Bikes at stations with an excess number of bikes can be relocated to stations that observe a shortage in the number of bikes available. This strategy is also relatively easier to implement in Singapore as the bikes are not required to put park in docks at the stations, and would only need to be parked in the designated geo-zone.

6. Conclusion

Through this project, our team has analysed many aspects of the industry of bike-sharing services. We have conducted statistical, geo-spatial and graphical analysis on bike-sharing data and provided improvements to current industry practices. These analysis will be packaged into a application that will help both bike-sharing companies and their consumers alike.

By adopting these recommendations, we hope that the issues of indiscriminate parking and the shortage and excess of bicycles in stations will be reduced.